Analysis of FBI Crime and Officer Data

Megane Crenshaw

Introduction

With the progress that has been made in technology and machine learning in the past decade, data science has become a very popular and in demand profession. This tutorial aims to give an introduction to data science by exploring and analyzing data from the FBI. The data we will be using comes from the Uniform Crime Reporting (UCR) program. In this program, law enforcement agencies across the nation provide data to the FBI, who then publishes it. For more information on this program you can visit https://www.fbi.gov/services/cjis/ucr. We will be using data from 2007 to 2017 to see if there is a relationship between number of officers and crime rate in order to hopefully inform policy and decision making in law enforcement.

For this tutorial you will be using the pandas library to read and organize the data, and matplotlib.pyplot to graph the data. For exploring and analyzing the data we will also be using SciKit-Learn. This tutorial assumes prior knowledge.

Collect Data

This is the first step in the data lifecycle. Here we just gather the data through methods like web scraping. Fortunately, the FBI provides Excel tables that we can download, so we don't need to scrape the web. To download these tables yourself, visit https://ucr.fbi.gov/crime-in-the-u.s/, click on the year you want data for, and then navigate to the violent crimes tables and police employee tables. We will be using table 4 for crime data and table 71 for officer data (the data is organized differently for 2016, so for 2016 we will be using tables 2 and 26). To read the tables use the pandas function read_excel().

In [1]:
import pandas
import re
import matplotlib.pyplot as plt
from sklearn import linear_model
import numpy as np
In [2]:
crime_2017 = pandas.read_excel("2017 Table 4.xls")
officer_2017 = pandas.read_excel("2017 Table 71.xls")

crime_2016 = pandas.read_excel("2016 Table 2.xls")
officer_2016 = pandas.read_excel("2016 Table 24.xls")

crime_2015 = pandas.read_excel("2015 Table 4.xls")
officer_2015 = pandas.read_excel("2015 Table 71.xls")

crime_2014 = pandas.read_excel("2014 Table 4.xls")
officer_2014 = pandas.read_excel("2014 Table 71.xls")

crime_2013 = pandas.read_excel("2013 Table 4.xls")
officer_2013 = pandas.read_excel("2013 Table 71.xls")

crime_2012 = pandas.read_excel("2012 Table 4.xls")
officer_2012 = pandas.read_excel("2012 Table 71.xls")

crime_2011 = pandas.read_excel("2011 Table 4.xls")
officer_2011 = pandas.read_excel("2011 Table 71.xls")

crime_2010 = pandas.read_excel("2010 Table 4.xls")
officer_2010 = pandas.read_excel("2010 Table 71.xls")

crime_2009 = pandas.read_excel("2009 Table 4.xls")
officer_2009 = pandas.read_excel("2009 Table 71.xls")

crime_2008 = pandas.read_excel("2008 Table 4.xls")
officer_2008 = pandas.read_excel("2008 Table 71.xls")

crime_2007 = pandas.read_excel("2007 Table 4.xls")
officer_2007 = pandas.read_excel("2007 Table 71.xls")

crime_2017.head(10)
Out[2]:
Table 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23
0 Crime in the United States1 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 by Region, Geographic Division, and State, 201... NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Area Year Population2 Violent crime3 NaN Murder and \nnonnegligent \nmanslaughter NaN Rape\n(revised definition)4 NaN Robbery ... NaN Burglary NaN Larceny-theft NaN Motor vehicle theft NaN NaN NaN NaN
3 NaN NaN NaN Rate per \n100,000 NaN Rate per \n100,000 NaN Rate per \n100,000 NaN ... Rate per \n100,000 NaN Rate per \n100,000 NaN Rate per \n100,000 NaN Rate per \n100,000 NaN NaN
4 United States Total5, 6 2016 323405935 1285606 397.5 17413 5.4 132414 40.9 332797 ... 2451.6 1516405 468.9 5644835 1745.4 767290 237.3 NaN NaN
5 NaN 2017 325719178 1283220 394 17284 5.3 135755 41.7 319356 ... 2362.2 1401840 430.4 5519107 1694.4 773139 237.4 NaN NaN NaN
6 NaN Percent change NaN -0.2 -0.9 -0.7 -1.4 2.5 1.8 -4 ... -3.6 -7.6 -8.2 -2.2 -2.9 0.8 * NaN NaN NaN
7 Northeast 2016 56359360 178406 316.6 1977 3.5 16763 29.7 52958 ... 1610.7 142092 252.1 708772 1257.6 56917 101 NaN NaN
8 NaN 2017 56470581 172042 304.7 1957 3.5 16863 29.9 48714 ... 1558.6 127908 226.5 695447 1231.5 56799 100.6 NaN NaN NaN
9 NaN Percent change NaN -3.6 -3.8 -1 -1.2 0.6 0.4 -8 ... -3.2 -10 -10.2 -1.9 -2.1 -0.2 -0.4 NaN NaN

10 rows × 24 columns

Data Processing

So now we have all our data, but as you can see it is messy. In the data processing step of the lifecycle, we organize and "fix" the data so that it is readable and easy to use in the next steps, visualization and analysis. For this data, to "fix" it we will rename the columns, and change the index to be a MultiIndex. In addition, we will only be looking at data per region, so we can drop all the rows for individual states. For more information on what tidy data should look like, visit https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html.

The data for officers only has data per region, so we will remove the data for individual states from the crime tables since we will not be using it.

In [3]:
#remove empty columns
crime_2017_fixed = crime_2017.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)

#rename columns 
crime_2017_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

#remove empty rows 
crime_2017_fixed = crime_2017_fixed.drop([0,1,2,3,202,203,204,205,206,207,208,209])

# we are including DC and Puerto Rico because they are not regions are therefore not in the officer tables, so we
# will not be using them
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming", "District of Columbia", "Puerto Rico"]

to_delete = []

#remove states from the data
for index, row in crime_2017_fixed.iterrows():
    area_name = str(row["Area"]).strip()
    area_name = re.sub(r'\d+', '', area_name)
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2017_fixed = crime_2017_fixed.drop(to_delete)  

#change the index to a multi-index with 2 levels, one for region and one for year
idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                  ['2016', '2017', 'Percent Change']],
                                 names=['Area', 'Year'])

crime_2017_fixed.index = idx

#drop the area and year columns since they are now unnecessary
crime_2017_fixed = crime_2017_fixed.drop(["Area", "Year"], axis = 1)
crime_2017_fixed.head(10)
Out[3]:
Population Violent Crime Violent Crime Rate per 100,000 Murder and Nonnegligent Manslaughter Muderer Rate Rape Revised Definition Rape (Revised) Rate Robbery Robbery Rate Aggravated Assault Aggravated Assault Rate Property Crime Property Crime Rate Burglary Burglary Rate Larceny-Theft Larceny-Theft Rate Motor Vehicle Theft Motor Vehicle Theft Rate
Area Year
United States Total 2016 323405935 1285606 397.5 17413 5.4 132414 40.9 332797 102.9 802982 248.3 7928530 2451.6 1516405 468.9 5644835 1745.4 767290 237.3
2017 325719178 1283220 394 17284 5.3 135755 41.7 319356 98 810825 248.9 7694086 2362.2 1401840 430.4 5519107 1694.4 773139 237.4
Percent Change NaN -0.2 -0.9 -0.7 -1.4 2.5 1.8 -4 -4.7 1 0.3 -3 -3.6 -7.6 -8.2 -2.2 -2.9 0.8 *
Northeast 2016 56359360 178406 316.6 1977 3.5 16763 29.7 52958 94 106708 189.3 907781 1610.7 142092 252.1 708772 1257.6 56917 101
2017 56470581 172042 304.7 1957 3.5 16863 29.9 48714 86.3 104508 185.1 880154 1558.6 127908 226.5 695447 1231.5 56799 100.6
Percent Change NaN -3.6 -3.8 -1 -1.2 0.6 0.4 -8 -8.2 -2.1 -2.3 -3 -3.2 -10 -10.2 -1.9 -2.1 -0.2 -0.4
New England 2016 14757573 41841 283.5 296 2 4569 31 9394 63.7 27582 186.9 242151 1640.9 41835 283.5 182176 1234.5 18140 122.9
2017 14810001 40519 273.6 346 2.3 4833 32.6 8918 60.2 26422 178.4 228293 1541.5 36958 249.5 173101 1168.8 18234 123.1
Percent Change -3.2 -3.5 16.9 16.5 5.8 5.4 -5.1 -5.4 -4.2 -4.5 -5.7 -6.1 -11.7 -12 -5 -5.3 0.5 0.2
Middle Atlantic 2016 41601787 136565 328.3 1681 4 12194 29.3 43564 104.7 79126 190.2 665630 1600 100257 241 526596 1265.8 38777 93.2

The data looks much better and much more readable. Now we do this for the rest of the tables.

In [4]:
crime_2016_fixed = crime_2016.drop(["Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2016_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2016_fixed = crime_2016_fixed.drop([0,1,2,3,202,203,204,205,206,207,208,209,210,211,212])

to_delete = []

for index, row in crime_2016_fixed.iterrows():
    area_name = str(row["Area"]).strip()
    area_name = re.sub(r'\d+', '', area_name)
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2016_fixed = crime_2016_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2015', '2016', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2016_fixed.index = idx

crime_2016_fixed = crime_2016_fixed.drop(["Area", "Year"], axis = 1)
In [5]:
crime_2015_fixed = crime_2015.drop(["Unnamed: 23", "Unnamed: 24", "Unnamed: 25", "Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2015_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2015_fixed = crime_2015_fixed.drop([0,1,2,3,202,203,204,205,206,207,208,209,210,211,212, 213])

to_delete = []

for index, row in crime_2015_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2015_fixed = crime_2015_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2014', '2015', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2015_fixed.index = idx

crime_2015_fixed = crime_2015_fixed.drop(["Area", "Year"], axis = 1)
In [6]:
crime_2014_fixed = crime_2014.drop(["Unnamed: 23", "Unnamed: 24", "Unnamed: 25", "Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2014_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2014_fixed = crime_2014_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212, 213])

to_delete = []

for index, row in crime_2014_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2014_fixed = crime_2014_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2013', '2014', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2014_fixed.index = idx

crime_2014_fixed = crime_2014_fixed.drop(["Area", "Year"], axis = 1)
In [7]:
crime_2013_fixed = crime_2013.drop(["Unnamed: 23", "Unnamed: 24", "Unnamed: 25", "Unnamed: 9", "Unnamed: 10"], axis = 1)
crime_2013_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2013_fixed = crime_2013_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212,213])

to_delete = []

for index, row in crime_2013_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2013_fixed = crime_2013_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2012', '2013', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2013_fixed.index = idx

crime_2013_fixed = crime_2013_fixed.drop(["Area", "Year"], axis = 1)
In [8]:
crime_2012_fixed = crime_2012.drop(["Unnamed: 21"], axis = 1)
crime_2012_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2012_fixed = crime_2012_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209])

to_delete = []

for index, row in crime_2012_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2012_fixed = crime_2012_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2011', '2012', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2012_fixed.index = idx

crime_2012_fixed = crime_2012_fixed.drop(["Area", "Year"], axis = 1)
In [9]:
crime_2011_fixed = crime_2011.drop(["Unnamed: 21", "Unnamed: 22"], axis = 1)
crime_2011_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2011_fixed = crime_2011_fixed.drop([0,1,2,3,202,203,204,205,206,207,208])

to_delete = []

for index, row in crime_2011_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
    
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2011_fixed = crime_2011_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2010', '2011', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2011_fixed.index = idx

crime_2011_fixed = crime_2011_fixed.drop(["Area", "Year"], axis = 1)
In [10]:
crime_2010_fixed = crime_2010.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2010_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2010_fixed = crime_2010_fixed.drop([0,1,2,3,199,200,201,202,203,204])

to_delete = []

for index, row in crime_2010_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2010_fixed = crime_2010_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2009', '2010', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2010_fixed.index = idx

crime_2010_fixed = crime_2010_fixed.drop(["Area", "Year"], axis = 1)
In [11]:
crime_2009_fixed = crime_2009.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2009_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2009_fixed = crime_2009_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212,213,214,215])

to_delete = []

for index, row in crime_2009_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2009_fixed = crime_2009_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2008', '2009', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2009_fixed.index = idx

crime_2009_fixed = crime_2009_fixed.drop(["Area", "Year"], axis = 1)
In [12]:
crime_2008_fixed = crime_2008.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2008_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2008_fixed = crime_2008_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210])

to_delete = []

for index, row in crime_2008_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2008_fixed = crime_2008_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2007', '2008', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2008_fixed.index = idx

crime_2008_fixed = crime_2008_fixed.drop(["Area", "Year"], axis = 1)
In [13]:
crime_2007_fixed = crime_2007.drop(["Unnamed: 21", "Unnamed: 22", "Unnamed: 23"], axis = 1)
crime_2007_fixed.columns = ["Area", "Year", "Population", "Violent Crime", "Violent Crime Rate per 100,000", 
                      "Murder and Nonnegligent Manslaughter", "Muderer Rate", "Rape Revised Definition", 
                      "Rape (Revised) Rate", "Robbery", "Robbery Rate", "Aggravated Assault", "Aggravated Assault Rate",
                      "Property Crime", "Property Crime Rate", "Burglary", "Burglary Rate", "Larceny-Theft", 
                      "Larceny-Theft Rate", "Motor Vehicle Theft", "Motor Vehicle Theft Rate"]

crime_2007_fixed = crime_2007_fixed.drop([0,1,2,3,4,203,204,205,206,207,208,209,210,211,212,213,214,215,216])

to_delete = []

for index, row in crime_2007_fixed.iterrows():
    area_name = str(row["Area"]).replace(",", "").strip()
    area_name = re.sub(r'\d+', '', area_name).strip()
   
    if area_name in states:
        to_delete.extend([index,index+1, index+2])
    
crime_2007_fixed = crime_2007_fixed.drop(to_delete)  

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic', 
                                   'Midwest', 'East North Central', 'West North Central', 'South', 
                                   'South Atlantic', 'East South Central', 
                                   'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['2006', '2007', 'Percent Change']],
                                     names=['Area', 'Year'])

crime_2007_fixed.index = idx

crime_2007_fixed = crime_2007_fixed.drop(["Area", "Year"], axis = 1)
In [14]:
officer_2017_fixed = officer_2017.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)

columns = ["Region", "Number of Officers", "Total", "Group 1", "Group 2", "Group 3", "Group 4", "Group 5", 
           "Group 6"]

officer_2017_fixed.columns = columns

officer_2017_fixed = officer_2017_fixed.drop([0,1,2,3,32,33,34,35])

idx = pandas.MultiIndex.from_product([['United States Total', 'Northeast', 'New England', 'Middle Atlantic',
                                       'Midwest', 'East North Central', 'West North Central', 'South', 
                                       'South Atlantic', 'East South Central', 
                                       'West South Central', 'West', 'Mountain', 'Pacific'],
                                      ['Number of Officers', 'Average per 1,000']],
                                     names=['Region', 'Officers'])

officer_2017_fixed.index = idx
officer_2017_fixed = officer_2017_fixed.drop(["Region", "Number of Officers"], axis=1)

officer_2017_fixed.head(10)
Out[14]:
Total Group 1 Group 2 Group 3 Group 4 Group 5 Group 6
Region Officers
United States Total Number of Officers 422869 155692 47814 47850 45672 47866 77975
Average per 1,000 2.2 2.6 1.6 1.6 1.7 1.9 3.8
Northeast Number of Officers 119962 48868 7188 13201 15436 15893 19376
Average per 1,000 2.7 4.1 2.5 2 1.8 1.8 3
New England Number of Officers 28038 2205 3831 5409 5985 5628 4980
Average per 1,000 2.2 3.2 2.6 2 1.8 1.8 2.9
Middle Atlantic Number of Officers 91924 46663 3357 7792 9451 10265 14396
Average per 1,000 2.8 4.1 2.4 2 1.8 1.8 3.1
Midwest Number of Officers 82591 28201 6130 10896 10069 11343 15952
Average per 1,000 2.1 3 1.6 1.5 1.5 1.7 2.8
In [15]:
officer_2016_fixed = officer_2016.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2016_fixed.columns = columns

officer_2016_fixed = officer_2016_fixed.drop([0,1,2,3,32,33])


officer_2016_fixed.index = idx
officer_2016_fixed = officer_2016_fixed.drop(["Region", "Number of Officers"], axis=1)
In [16]:
officer_2015_fixed = officer_2015.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2015_fixed.columns = columns

officer_2015_fixed = officer_2015_fixed.drop([0,1,2,3,32,33])


officer_2015_fixed.index = idx
officer_2015_fixed = officer_2015_fixed.drop(["Region", "Number of Officers"], axis=1)
In [17]:
officer_2014_fixed = officer_2014.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2014_fixed.columns = columns

officer_2014_fixed = officer_2014_fixed.drop([0,1,2,3,32,33])

officer_2014_fixed.index = idx
officer_2014_fixed = officer_2014_fixed.drop(["Region", "Number of Officers"], axis=1)
In [18]:
officer_2013_fixed = officer_2013.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2013_fixed.columns = columns

officer_2013_fixed = officer_2013_fixed.drop([0,1,2,3,32,33,34,35,36,37])

officer_2013_fixed.index = idx
officer_2013_fixed = officer_2013_fixed.drop(["Region", "Number of Officers"], axis=1)
In [19]:
officer_2012_fixed = officer_2012.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2012_fixed.columns = columns

officer_2012_fixed = officer_2012_fixed.drop([0,1,2,3,32,33,34,35,36,37])

officer_2012_fixed.index = idx
officer_2012_fixed = officer_2012_fixed.drop(["Region", "Number of Officers"], axis=1)
In [20]:
officer_2011_fixed = officer_2011.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2011_fixed.columns = columns

officer_2011_fixed = officer_2011_fixed.drop([0,1,2,3,32,33,34,35,36,37])

officer_2011_fixed.index = idx
officer_2011_fixed = officer_2011_fixed.drop(["Region", "Number of Officers"], axis=1)
In [21]:
officer_2010_fixed = officer_2010.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2010_fixed.columns = columns

officer_2010_fixed = officer_2010_fixed.drop([0,1,2,3,32,33,34,35,36,37])

officer_2010_fixed.index = idx
officer_2010_fixed = officer_2010_fixed.drop(["Region", "Number of Officers"], axis=1)
In [22]:
officer_2009_fixed = officer_2009.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2009_fixed.columns = columns

officer_2009_fixed = officer_2009_fixed.drop([0,1,2,3,32,33,34,35,36,37])

officer_2009_fixed.index = idx
officer_2009_fixed = officer_2009_fixed.drop(["Region", "Number of Officers"], axis=1)
In [23]:
officer_2008_fixed = officer_2008.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2008_fixed.columns = columns

officer_2008_fixed = officer_2008_fixed.drop([0,1,2,3,32,33,34,35,36,37,38])

officer_2008_fixed.index = idx
officer_2008_fixed = officer_2008_fixed.drop(["Region", "Number of Officers"], axis=1)
In [24]:
officer_2007_fixed = officer_2009.drop(["Unnamed: 9", "Unnamed: 10", "Unnamed: 11", "Unnamed: 12", "Unnamed: 13", "Unnamed: 14"], axis = 1)
officer_2007_fixed.columns = columns

officer_2007_fixed = officer_2007_fixed.drop([0,1,2,3,32,33,34,35,36,37])

officer_2007_fixed.index = idx
officer_2007_fixed = officer_2007_fixed.drop(["Region", "Number of Officers"], axis=1)

Our data is much cleaner now; however, we are not done. In the visualization step we want to plot things over time, so it would be helpful to reorganize the tables to be per region instead of per year to make plotting easier.

Merging two tables uses the index to line up the rows of the two tables, so we change the index on both the crime and officer tables to make sure the tables merge properly.

In [25]:
#gather the data for total crime in the US for each year
total_crime = pandas.DataFrame([crime_2017_fixed.iloc[1], crime_2016_fixed.iloc[1], crime_2015_fixed.iloc[1],
                                crime_2014_fixed.iloc[1], crime_2013_fixed.iloc[1], crime_2012_fixed.iloc[1], 
                                crime_2011_fixed.iloc[1], crime_2010_fixed.iloc[1], crime_2009_fixed.iloc[1], 
                                crime_2008_fixed.iloc[1], crime_2007_fixed.iloc[1], 
                               ])

#add a column for year and change the index back to default
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
total_crime.insert(loc=0, column="Year", value=years)
total_crime.index=[0,1,2,3,4,5,6,7,8,9,10]

#gather data for officers in the US for each year
total_officers = pandas.DataFrame([officer_2017_fixed.iloc[1], officer_2016_fixed.iloc[1], officer_2015_fixed.iloc[1],
                                officer_2014_fixed.iloc[1], officer_2013_fixed.iloc[1], officer_2012_fixed.iloc[1], 
                                officer_2011_fixed.iloc[1], officer_2010_fixed.iloc[1], officer_2009_fixed.iloc[1], 
                                officer_2008_fixed.iloc[1], officer_2007_fixed.iloc[1]])

#change the index to default and merge the crime and officer tables
total_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
total = pandas.concat([total_crime, total_officers], sort=False, axis=1)
total.name = "total"
total
Out[25]:
Year Population Violent Crime Violent Crime Rate per 100,000 Murder and Nonnegligent Manslaughter Muderer Rate Rape Revised Definition Rape (Revised) Rate Robbery Robbery Rate ... Larceny-Theft Rate Motor Vehicle Theft Motor Vehicle Theft Rate Total Group 1 Group 2 Group 3 Group 4 Group 5 Group 6
0 2017 325719178 1283220 394.0 17284 5.3 135755 41.7 319356 98.0 ... 1694.4 773139 237.4 2.2 2.6 1.6 1.6 1.7 1.9 3.8
1 2016 323127513 1283058 397.1 17250 5.3 130603 40.4 332198 102.8 ... 1745.0 765484 236.9 2.2 2.6 1.6 1.6 1.7 1.9 3.7
2 2015 321418820 1231566 383.2 15696 4.9 124047 38.6 327374 101.9 ... 1775.4 707758 220.2 2.1 2.5 1.7 1.6 1.7 1.8 3.7
3 2014 318857056 1197987 375.7 14249 4.5 116645 36.6 325802 102.2 ... 1837.3 689527 216.2 2.2 2.6 1.7 1.6 1.7 1.8 3.7
4 2013 316128839 1163146 367.9 14196 4.5 108612 34.4 345031 109.1 ... 1899.4 699594 221.3 2.2 2.6 1.7 1.6 1.7 1.8 3.5
5 2012 313914040 1214464 386.9 14827 4.7 84376 26.9 354522 112.9 ... 1959.3 721053 229.7 2.2 2.7 1.7 1.6 1.7 1.9 3.6
6 2011 311591917 1203564 386.3 14612 4.7 83425 26.8 354396 113.7 ... 1976.9 715373 229.6 2.2 2.7 1.8 1.7 1.7 1.9 3.5
7 2010 308745538 1246248 403.6 14748 4.8 84767 27.5 367832 119.1 ... 2003.5 737142 238.8 2.3 2.7 1.8 1.7 1.8 1.9 3.5
8 2009 307006550 1318398 429.4 15241 5.0 88097 28.7 408217 133.0 ... 2060.9 794616 258.8 2.3 2.8 1.8 1.7 1.8 1.9 3.5
9 2008 304059724 1382012 454.5 16272 5.4 89000 29.3 441855 145.3 ... 2167.0 956846 314.7 2.3 2.8 1.9 1.8 1.8 1.9 3.5
10 2007 301621157 1408337 466.9 16929 5.6 90427 30.0 445125 147.6 ... 2177.8 1095769 363.3 2.3 2.8 1.8 1.7 1.8 1.9 3.5

11 rows × 27 columns

Now we do this for the rest of the regions.

In [47]:
northeast = pandas.DataFrame([crime_2017_fixed.iloc[4], crime_2016_fixed.iloc[4], crime_2015_fixed.iloc[4],
                                crime_2014_fixed.iloc[4], crime_2013_fixed.iloc[4], crime_2012_fixed.iloc[4], 
                                crime_2011_fixed.iloc[4], crime_2010_fixed.iloc[4], crime_2009_fixed.iloc[4], 
                                crime_2008_fixed.iloc[4], crime_2007_fixed.iloc[4]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
northeast.insert(loc=0, column="Year", value=years)
northeast.index=[0,1,2,3,4,5,6,7,8,9,10]

northeast_officers = pandas.DataFrame([officer_2017_fixed.iloc[3], officer_2016_fixed.iloc[3], officer_2015_fixed.iloc[3],
                                officer_2014_fixed.iloc[3], officer_2013_fixed.iloc[3], officer_2012_fixed.iloc[3], 
                                officer_2011_fixed.iloc[3], officer_2010_fixed.iloc[3], officer_2009_fixed.iloc[3], 
                                officer_2008_fixed.iloc[3], officer_2007_fixed.iloc[3]])
northeast_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
northeast = pandas.concat([northeast, northeast_officers], sort=False, axis=1)
northeast.name = "Northeast"


new_england = pandas.DataFrame([crime_2017_fixed.iloc[7], crime_2016_fixed.iloc[7], crime_2015_fixed.iloc[7],
                                crime_2014_fixed.iloc[7], crime_2013_fixed.iloc[7], crime_2012_fixed.iloc[7], 
                                crime_2011_fixed.iloc[7], crime_2010_fixed.iloc[7], crime_2009_fixed.iloc[7], 
                                crime_2008_fixed.iloc[7], crime_2007_fixed.iloc[7]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
new_england.insert(loc=0, column="Year", value=years)
new_england.index=[0,1,2,3,4,5,6,7,8,9,10]

new_england_officers = pandas.DataFrame([officer_2017_fixed.iloc[5], officer_2016_fixed.iloc[5], officer_2015_fixed.iloc[5],
                                officer_2014_fixed.iloc[5], officer_2013_fixed.iloc[5], officer_2012_fixed.iloc[5], 
                                officer_2011_fixed.iloc[5], officer_2010_fixed.iloc[5], officer_2009_fixed.iloc[5], 
                                officer_2008_fixed.iloc[5], officer_2007_fixed.iloc[5]])
new_england_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
new_england = pandas.concat([new_england, new_england_officers], sort=False, axis=1)

new_england.name = "New England"


middle_atlantic = pandas.DataFrame([crime_2017_fixed.iloc[10], crime_2016_fixed.iloc[10], crime_2015_fixed.iloc[10],
                                crime_2014_fixed.iloc[10], crime_2013_fixed.iloc[10], crime_2012_fixed.iloc[10], 
                                crime_2011_fixed.iloc[10], crime_2010_fixed.iloc[10], crime_2009_fixed.iloc[10], 
                                crime_2008_fixed.iloc[10], crime_2007_fixed.iloc[10]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
middle_atlantic.insert(loc=0, column="Year", value=years)
middle_atlantic.index=[0,1,2,3,4,5,6,7,8,9,10]

middle_atlantic_officers = pandas.DataFrame([officer_2017_fixed.iloc[7], officer_2016_fixed.iloc[7], officer_2015_fixed.iloc[7],
                                officer_2014_fixed.iloc[7], officer_2013_fixed.iloc[7], officer_2012_fixed.iloc[7], 
                                officer_2011_fixed.iloc[7], officer_2010_fixed.iloc[7], officer_2009_fixed.iloc[7], 
                                officer_2008_fixed.iloc[7], officer_2007_fixed.iloc[7]])
middle_atlantic_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
middle_atlantic = pandas.concat([middle_atlantic, middle_atlantic_officers], sort=False, axis=1)

middle_atlantic.name = "Middle Atlantic"


midwest = pandas.DataFrame([crime_2017_fixed.iloc[13], crime_2016_fixed.iloc[13], crime_2015_fixed.iloc[13],
                                crime_2014_fixed.iloc[13], crime_2013_fixed.iloc[13], crime_2012_fixed.iloc[13], 
                                crime_2011_fixed.iloc[13], crime_2010_fixed.iloc[13], crime_2009_fixed.iloc[13], 
                                crime_2008_fixed.iloc[13], crime_2007_fixed.iloc[13]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
midwest.insert(loc=0, column="Year", value=years)
midwest.index=[0,1,2,3,4,5,6,7,8,9,10]

midwest_officers = pandas.DataFrame([officer_2017_fixed.iloc[9], officer_2016_fixed.iloc[9], officer_2015_fixed.iloc[9],
                                officer_2014_fixed.iloc[9], officer_2013_fixed.iloc[9], officer_2012_fixed.iloc[9], 
                                officer_2011_fixed.iloc[9], officer_2010_fixed.iloc[9], officer_2009_fixed.iloc[9], 
                                officer_2008_fixed.iloc[9], officer_2007_fixed.iloc[9]])
midwest_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
midwest = pandas.concat([midwest, midwest_officers], sort=False, axis=1)

midwest.name = "Midwest"


east_north_central = pandas.DataFrame([crime_2017_fixed.iloc[16], crime_2016_fixed.iloc[16], crime_2015_fixed.iloc[16],
                                crime_2014_fixed.iloc[16], crime_2013_fixed.iloc[16], crime_2012_fixed.iloc[16], 
                                crime_2011_fixed.iloc[16], crime_2010_fixed.iloc[16], crime_2009_fixed.iloc[16], 
                                crime_2008_fixed.iloc[16], crime_2007_fixed.iloc[16]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
east_north_central.insert(loc=0, column="Year", value=years)
east_north_central.index=[0,1,2,3,4,5,6,7,8,9,10]

east_north_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[11], officer_2016_fixed.iloc[11], officer_2015_fixed.iloc[11],
                                officer_2014_fixed.iloc[11], officer_2013_fixed.iloc[11], officer_2012_fixed.iloc[11], 
                                officer_2011_fixed.iloc[11], officer_2010_fixed.iloc[11], officer_2009_fixed.iloc[11], 
                                officer_2008_fixed.iloc[11], officer_2007_fixed.iloc[11]])
east_north_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
east_north_central = pandas.concat([east_north_central, east_north_central_officers], sort=False, axis=1)

east_north_central.name = "East North Central"


west_north_central = pandas.DataFrame([crime_2017_fixed.iloc[19], crime_2016_fixed.iloc[19], crime_2015_fixed.iloc[19],
                                crime_2014_fixed.iloc[19], crime_2013_fixed.iloc[19], crime_2012_fixed.iloc[19], 
                                crime_2011_fixed.iloc[19], crime_2010_fixed.iloc[19], crime_2009_fixed.iloc[19], 
                                crime_2008_fixed.iloc[19], crime_2007_fixed.iloc[19]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
west_north_central.insert(loc=0, column="Year", value=years)
west_north_central.index=[0,1,2,3,4,5,6,7,8,9,10]


west_north_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[13], officer_2016_fixed.iloc[13], officer_2015_fixed.iloc[13],
                                officer_2014_fixed.iloc[13], officer_2013_fixed.iloc[13], officer_2012_fixed.iloc[13], 
                                officer_2011_fixed.iloc[13], officer_2010_fixed.iloc[13], officer_2009_fixed.iloc[13], 
                                officer_2008_fixed.iloc[13], officer_2007_fixed.iloc[13]])
west_north_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
west_north_central = pandas.concat([west_north_central, west_north_central_officers], sort=False, axis=1)

west_north_central.name = "West North Central"


south = pandas.DataFrame([crime_2017_fixed.iloc[22], crime_2016_fixed.iloc[22], crime_2015_fixed.iloc[22],
                                crime_2014_fixed.iloc[22], crime_2013_fixed.iloc[22], crime_2012_fixed.iloc[22], 
                                crime_2011_fixed.iloc[22], crime_2010_fixed.iloc[22], crime_2009_fixed.iloc[22], 
                                crime_2008_fixed.iloc[22], crime_2007_fixed.iloc[22]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
south.insert(loc=0, column="Year", value=years)
south.index=[0,1,2,3,4,5,6,7,8,9,10]

south_officers = pandas.DataFrame([officer_2017_fixed.iloc[15], officer_2016_fixed.iloc[15], officer_2015_fixed.iloc[15],
                                officer_2014_fixed.iloc[15], officer_2013_fixed.iloc[15], officer_2012_fixed.iloc[15], 
                                officer_2011_fixed.iloc[15], officer_2010_fixed.iloc[15], officer_2009_fixed.iloc[15], 
                                officer_2008_fixed.iloc[15], officer_2007_fixed.iloc[15]])
south_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
south = pandas.concat([south, south_officers], sort=False, axis=1)

south.name = "South"


south_atlantic = pandas.DataFrame([crime_2017_fixed.iloc[25], crime_2016_fixed.iloc[25], crime_2015_fixed.iloc[25],
                                crime_2014_fixed.iloc[25], crime_2013_fixed.iloc[25], crime_2012_fixed.iloc[25], 
                                crime_2011_fixed.iloc[25], crime_2010_fixed.iloc[25], crime_2009_fixed.iloc[25], 
                                crime_2008_fixed.iloc[25], crime_2007_fixed.iloc[25]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
south_atlantic.insert(loc=0, column="Year", value=years)
south_atlantic.index=[0,1,2,3,4,5,6,7,8,9,10]

south_atlantic_officers = pandas.DataFrame([officer_2017_fixed.iloc[17], officer_2016_fixed.iloc[17], officer_2015_fixed.iloc[17],
                                officer_2014_fixed.iloc[17], officer_2013_fixed.iloc[17], officer_2012_fixed.iloc[17], 
                                officer_2011_fixed.iloc[17], officer_2010_fixed.iloc[17], officer_2009_fixed.iloc[17], 
                                officer_2008_fixed.iloc[17], officer_2007_fixed.iloc[17]])
south_atlantic_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
south_atlantic = pandas.concat([south_atlantic, south_atlantic_officers], sort=False, axis=1)

south_atlantic.name = "South Atlantic"


east_south_central = pandas.DataFrame([crime_2017_fixed.iloc[28], crime_2016_fixed.iloc[28], crime_2015_fixed.iloc[28],
                                crime_2014_fixed.iloc[28], crime_2013_fixed.iloc[28], crime_2012_fixed.iloc[28], 
                                crime_2011_fixed.iloc[28], crime_2010_fixed.iloc[28], crime_2009_fixed.iloc[28], 
                                crime_2008_fixed.iloc[28], crime_2007_fixed.iloc[28]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
east_south_central.insert(loc=0, column="Year", value=years)
east_south_central.index=[0,1,2,3,4,5,6,7,8,9,10]

east_south_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[19], officer_2016_fixed.iloc[19], officer_2015_fixed.iloc[19],
                                officer_2014_fixed.iloc[19], officer_2013_fixed.iloc[19], officer_2012_fixed.iloc[19], 
                                officer_2011_fixed.iloc[19], officer_2010_fixed.iloc[19], officer_2009_fixed.iloc[19], 
                                officer_2008_fixed.iloc[19], officer_2007_fixed.iloc[19]])
east_south_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
east_south_central = pandas.concat([east_south_central, east_south_central_officers], sort=False, axis=1)

east_south_central.name = "East South Central"


west_south_central = pandas.DataFrame([crime_2017_fixed.iloc[31], crime_2016_fixed.iloc[31], crime_2015_fixed.iloc[31],
                                crime_2014_fixed.iloc[31], crime_2013_fixed.iloc[31], crime_2012_fixed.iloc[31], 
                                crime_2011_fixed.iloc[31], crime_2010_fixed.iloc[31], crime_2009_fixed.iloc[31], 
                                crime_2008_fixed.iloc[31], crime_2007_fixed.iloc[31]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
west_south_central.insert(loc=0, column="Year", value=years)
west_south_central.index=[0,1,2,3,4,5,6,7,8,9,10]

west_south_central_officers = pandas.DataFrame([officer_2017_fixed.iloc[21], officer_2016_fixed.iloc[21], officer_2015_fixed.iloc[21],
                                officer_2014_fixed.iloc[21], officer_2013_fixed.iloc[21], officer_2012_fixed.iloc[21], 
                                officer_2011_fixed.iloc[21], officer_2010_fixed.iloc[21], officer_2009_fixed.iloc[21], 
                                officer_2008_fixed.iloc[21], officer_2007_fixed.iloc[21]])
west_south_central_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
west_south_central = pandas.concat([west_south_central, west_south_central_officers], sort=False, axis=1)

west_south_central.name = "West South Central"


west = pandas.DataFrame([crime_2017_fixed.iloc[34], crime_2016_fixed.iloc[34], crime_2015_fixed.iloc[34],
                                crime_2014_fixed.iloc[34], crime_2013_fixed.iloc[34], crime_2012_fixed.iloc[34], 
                                crime_2011_fixed.iloc[34], crime_2010_fixed.iloc[34], crime_2009_fixed.iloc[34], 
                                crime_2008_fixed.iloc[34], crime_2007_fixed.iloc[34]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
west.insert(loc=0, column="Year", value=years)
west.index=[0,1,2,3,4,5,6,7,8,9,10]

west_officers = pandas.DataFrame([officer_2017_fixed.iloc[23], officer_2016_fixed.iloc[23], officer_2015_fixed.iloc[23],
                                officer_2014_fixed.iloc[23], officer_2013_fixed.iloc[23], officer_2012_fixed.iloc[23], 
                                officer_2011_fixed.iloc[23], officer_2010_fixed.iloc[23], officer_2009_fixed.iloc[23], 
                                officer_2008_fixed.iloc[23], officer_2007_fixed.iloc[23]])
west_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
west = pandas.concat([west, west_officers], sort=False, axis=1)

west.name = "West"


mountain = pandas.DataFrame([crime_2017_fixed.iloc[37], crime_2016_fixed.iloc[37], crime_2015_fixed.iloc[37],
                                crime_2014_fixed.iloc[37], crime_2013_fixed.iloc[37], crime_2012_fixed.iloc[37], 
                                crime_2011_fixed.iloc[37], crime_2010_fixed.iloc[37], crime_2009_fixed.iloc[37], 
                                crime_2008_fixed.iloc[37], crime_2007_fixed.iloc[37]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
mountain.insert(loc=0, column="Year", value=years)
mountain.index=[0,1,2,3,4,5,6,7,8,9,10]

mountain_officers = pandas.DataFrame([officer_2017_fixed.iloc[25], officer_2016_fixed.iloc[25], officer_2015_fixed.iloc[25],
                                officer_2014_fixed.iloc[25], officer_2013_fixed.iloc[25], officer_2012_fixed.iloc[25], 
                                officer_2011_fixed.iloc[25], officer_2010_fixed.iloc[25], officer_2009_fixed.iloc[25], 
                                officer_2008_fixed.iloc[25], officer_2007_fixed.iloc[25]])
mountain_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
mountain = pandas.concat([mountain, mountain_officers], sort=False, axis=1)

mountain.name = "Mountain"


pacific = pandas.DataFrame([crime_2017_fixed.iloc[40], crime_2016_fixed.iloc[40], crime_2015_fixed.iloc[40],
                                crime_2014_fixed.iloc[40], crime_2013_fixed.iloc[40], crime_2012_fixed.iloc[40], 
                                crime_2011_fixed.iloc[40], crime_2010_fixed.iloc[40], crime_2009_fixed.iloc[40], 
                                crime_2008_fixed.iloc[40], crime_2007_fixed.iloc[40]])
years = ["2017","2016","2015","2014","2013","2012","2011","2010","2009","2008","2007"]
pacific.insert(loc=0, column="Year", value=years)
pacific.index=[0,1,2,3,4,5,6,7,8,9,10]

pacific_officers = pandas.DataFrame([officer_2017_fixed.iloc[27], officer_2016_fixed.iloc[27], officer_2015_fixed.iloc[27],
                                officer_2014_fixed.iloc[27], officer_2013_fixed.iloc[27], officer_2012_fixed.iloc[27], 
                                officer_2011_fixed.iloc[27], officer_2010_fixed.iloc[27], officer_2009_fixed.iloc[27], 
                                officer_2008_fixed.iloc[27], officer_2007_fixed.iloc[27]])
pacific_officers.index=[0,1,2,3,4,5,6,7,8,9,10]
pacific = pandas.concat([pacific, pacific_officers], sort=False, axis=1)

pacific.name = "Pacific"

Visualization

In this stage of the life cycle we plot the data to observe potential trends and see what would be useful to analyze. Let's start by plotting the number of violent crimes over time for each region.

In [49]:
plt.rcParams['figure.figsize'] = [10,10]

regions = [total, northeast, new_england, middle_atlantic, midwest, east_north_central, west_north_central, 
           south, south_atlantic, east_south_central, west_south_central, west, mountain, pacific]


for region in regions:
    plt.plot(region["Year"], region["Violent Crime"], label=region.name)
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)


plt.gca().invert_xaxis()
plt.title("Violent Crime over Time")
plt.show()
plt.close()

After plotting the data we can see that over time, crime is decreasing. However, different regions have different populations, so it may be more useful to look at crime rate over time instead of the raw crime numbers.

In [43]:
for region in regions:
    plt.plot(region["Year"], region["Violent Crime Rate per 100,000"], label=region.name)
    
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    #print(region["Year"])

    
plt.gca().invert_xaxis()

plt.show()
plt.close()

As we can see, looking at crime rate is more informative compared to looking at the raw numbers. Crime rate is still decreasing over time, but now we have a better visual for how things are changing over time.

Now let's plot the rates for different types of crime over time for each region.

In [45]:
crimes = ["Violent Crime Rate per 100,000", "Muderer Rate", "Rape (Revised) Rate", "Robbery Rate", 
          "Aggravated Assault Rate", "Property Crime Rate", "Burglary Rate", "Larceny-Theft Rate",
          "Motor Vehicle Theft Rate"]

for i in range(14):
    for crime in crimes:
        #axes[i].
        plt.plot(regions[i]["Year"], regions[i][crime],label=crime)
        plt.title(regions[i].name)
        
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.gca().invert_xaxis()
    plt.show()
    plt.close()

    

plt.show()
plt.close()
    

As we can see from these plots, for all the regions the most common crime is property crime. For most regions crime is either decreasing or staying constant. However, in some regions, like the midwest and east north central, the violent crime rate is starting to increase. In addition, in the mountain and pacific regions the motor vehicle theft rate is also starting to increase.

It may also be useful to plot by crime instead of by region.

In [46]:
for crime in crimes:
    for region in regions:
        plt.plot(region["Year"], region[crime], label=region.name)
        plt.title(crime)
    
    plt.gca().invert_xaxis()
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.show()
    plt.close()
    
    
plt.show()
plt.close()

These plots confirm what we had said previously; in some regions the rate of violent crime is increasing. These plots also show that where we were wrong when we previously though that crime was decreasing or staying constant. Some crimes, such as robbery, property crime, burglary, and larceny-theft are decreasing. However, some of the others that looked constant in the region plots are actually increasing, such as rape and murder. In addition, violent crime, aggravated assault, and motor vehicle theft were decreasing over time, but in the past few years have started to increase.

Next let's look at the rate of number of officers in a region over time, to see if there has been an increase in the number of officers.

In [50]:
for region in regions:
    plt.plot(region["Year"], region["Total"], label=region.name)

plt.gca().invert_xaxis()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
plt.close()

From this plot we can see that overall the number of officers is decreasing slightly over time. In addition, some regions have a lot more officers per 1,000 inhabitants than others. For example, in 2007 the South Atlantic had 3 officers per 1,000 whereas the Pacific only had 1.7.

Analysis

In this step of the data cycle we apply machine learning and statistics in order to uncover trends and relationships in the data. We will be using linear regression to obtain a predictive model of the data, which can be used to extrapolate.

After plotting the data, we want to see if there is a relationship between crime rate and the number of officers. To do this we will plot the number of officers per 1,000 inhabitants vs the crime rate for each crime. Then we will use linear regression to find a line of best fit for the data.

In [72]:
for crime in crimes:
    x = []
    y = []
    for region in regions:
        x.append(region["Total"].values)
        y.append(region[crime].values)
        
        
    x = np.array(x)
    x = x.flatten()
    y = np.array(y)
    y = y.flatten()
    
    
    plt.scatter(x, y)
    plt.title(crime)
        
    line = linear_model.LinearRegression()
        
    line.fit(x.reshape(-1,1),y)
    predicted = line.predict(x.reshape(-1,1))
        
    r_value = line.score(x.reshape(-1,1),y)
        
        
    plt.plot(x,predicted, c='r', label=r_value)
    plt.legend()
    
    plt.show()
    plt.close()

The r-values for each crime is less than .5, so there is not a strong linear relationship between number of officers and crime rate. However, the data is very scattered so there may be a relationship if we look at the data per crime and per region. Let's plot each region and crime individually to see if this is the case.

In [73]:
for region in regions:
    for crime in crimes:
        plt.scatter(region["Total"], region[crime], label=crime)
        plt.title(region.name + " " + crime)
        
        line = linear_model.LinearRegression()
        
        line.fit(region["Total"].values.reshape(-1,1),region[crime])
        predicted = line.predict(region["Total"].values.reshape(-1,1))
        
        r_value = line.score(region["Total"].values.reshape(-1,1),region[crime])
        
        
        plt.plot(region["Total"],predicted, c='r', label=r_value)
        plt.legend()
        plt.show()
        plt.close()

From these plots, we can see that some pairs of region and crime rate do have a stronger linear relationship. For example, in the west north central region, the r-value for rape is -.87, which indicates a strong linear relationship. Another example is in the same region the r-value for property crime is .72, for burglary it is .83, and for larceny-theft it is .72. In the South the r-value for rape is -.64 and the r-value for property crime is .63, which indicates a moderate linear relationship. In addition, in the South the r-value for burglary is .72. In the south atlantic the r-value for rape is -.61, for robbery it is .58, for property crime it is .85, burglary is .94, larceny-theft is .84. In the West the r-value for robbery is .74 and in the mountain region the r-value for robbery rate is .67. In the pacific region the r-value for violent crime is .61, the r-value for murder is .55, for robbery it's .84, and for property crime it is .55. For the rest of the regions and crimes, there is no linear relationship.

From this we can see that the number of officers does not uniformly effect crime per region, or region per crime. That is, whether there is a correlation between number of officers and crime rate depends on both the region and the crime, not just region and not just the crime. Furthermore, in most cases there is a positive correlation between number of officers and crime rate! For example, in the pacific region, aggravated assault has a positive correlation with the number of officers. The only crime that has a negative correlation is rape.

Insight

In the final step of the data cycle, we draw conclusions based off our analysis to inform decisions made based on the data.

In this case, we conclude that there is no relationship between number of officers and crime rate. When analyzing the data by crime and region, most regions and crimes had no linear relationship. Of the few that did, there was actually a positive correlation, indicating that with more officers there was more crime. However, not all crimes with a correlation had a positive one, since rape, when it was correlated, had a negative correlation.

Based on this further analysis would need to be done before making any decisions. Even if there is some positive correlation between number of officers and crime rate, cities should not reduce the number of officers. There is more to the relationship between number of officers and crime rate than is apparent, and further analysis could shed light on this. Some things to look into are gender ratio of officers, crime rates per city instead of per region, and officer salaries. In addition, more advanced machine learning techniques could be applied to the data, such as Gaussian Mixture Models. To learn more about machine learning visit https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12.

In [ ]: